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Abstract 

A number of recent emerging applications call for studying data streams, potentially infinite 
flows of information updated in real-time. When multiple co-evolving data streams are observed, 
an important task is to determine how these streams depend on each other, accounting for dy- 
namic dependence patterns without imposing any restrictive probabilistic law governing this de- 
pendence. In this paper we argue that flexible least squares (FLS), a penalized version of ordinary 
least squares that accommodates for time-varying regression coefficients, can be deployed suc- 
cessfully in this context. Our motivating application is statistical arbitrage, an investment strategy 
that exploits patterns detected in financial data streams. We demonstrate that FLS is algebraically 
equivalent to the well-known Kalman filter equations, and take advantage of this equivalence to 
gain a better understanding of FLS and suggest a more efficient algorithm. Promising experimen- 
tal results obtained from a FLS-based algorithmic trading system for the S&P 500 Futures Index 
are reported. 

Keywords: Temporal data mining, flexible least squares, time-varying regression, algorithmic trad- 
ing system, statistical arbitrage 



1 Introduction 

Temporal data mining is a fast-developing area concerned with processing and analyzing high- volume, 
high-speed data streams. A common example of data stream is a time series, a collection of univariate 
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or multivariate measurements indexed by time. Furthermore, each record in a data stream may have a 
complex structure involving both continuous and discrete measurements collected in sequential order. 
There are several application areas in which temporal data mining tools are being increasingly used, 
including finance, sensor networking, security, disaster management, e-commerce and many others. 
In the financial arena, data streams are being monitored and explored for many different purposes 
such as algorithmic trading, smart order routing, real-time compliance, and fraud detection. At the 
core of all such applications lies the common need to make time-aware, instant, intelligent decisions 
that exploit, in one way or another, patterns detected in the data. 

In the last decade we have seen an increasing trend by investment banks, hedge funds, and pro- 
prietary trading boutiques to systematize the trading of a variety of financial instruments. These com- 
panies resort to sophisticated trading platforms based on predictive models to transact market orders 
that serve specific speculative investment strategies. 

Algorithmic trading, otherwise known as automated or systematic trading, refers to the use of 
expert systems that enter trading orders without any user intervention; these systems decide on all 
aspects of the order such as the timing, price, and its final quantity. They effectively implement pattern 
recognition methods in order to detect and exploit market inefficiencies for speculative purposes. 
Moreover, automated trading systems can slice a large trade automatically into several smaller trades 
in order to hide its impact on the market (a technique called iceberging) and lower trading costs. 
According to the Financial Times, the London Stock Exchange foresees that about 60% of all its 
orders in the year 2007 will be entered by algorithmic trading. 

Over the years, a plethora of statistical and econ ometric techniques have been developed to an- 



alyze financial data BDe Gooijer and Hyndma , 



2006]. Classical time series analysis models, such as 



ARIMA and GARCH, as well as many other extensions and variations, are often used to obtain in 



2004]. 



sights into the mechanisms that generates the observed data and make predictions BChatfield , 
However, in some cases, conventional time series and other predictive models may not be up to the 
challenges that we face when developing modern algorithmic trading systems. Firstly, as the re- 
sult of developments in data collection and storage technologies, these applications generate massive 
amounts of data streams, thus requiring more efficient computational solutions. Such streams are 
delivered in real time; as new data points become available at very high frequency, the trading sys- 
tem needs to quickly adjust to the new information and take almost instantaneous buying and selling 
decisions. Secondly, these applications are mostly exploratory in nature: they are intended to detect 
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patterns in the data that may be continuously changing and evolving over time. Under this scenario, 
little prior knowledge should be injected into the models; the algorithms should require minimal as- 
sumptions about the data-generating process, as well as minimal user specification and intervention. 

In this work we focus on the problem of identifying time-varying dependencies between co- 
evolving data streams. This task can be casted into a regression problem: at any specified point 
in time, the system needs to quantify to what extent a particular stream depends on a possibly large 
number of other explanatory streams. In algorithmic trading applications, a data stream may comprise 
daily or intra-day prices or returns of a stock, an index or any other financial instrument. At each time 
point, we assume that a target stream of interest depends linearly on a number of other streams, but 
the coefficients of the regression models are allowed to evolve and change smoothly over time. 

The paper is organized as follows. In section |2] we briefly review a number of common trading 
strategies and formulate the problem arising in statistical arbitrage, thus proving some background 
material and motivation for the proposed methods. The flexible least squares (FLS) methodology is 
introduced in Section [3] as a powerful exploratory method for temporal data mining; this method fits 
our purposes well because it imposes no probabilistic assumptions and relies on minimal parameter 
specification. In Section [4] some assumptions of the FLS method are revisited, and we establish a 
clear connection between FLS and the well-known Kalman filter equations. This connection sheds 
light on the interpretation of the model, and naturally yields a modification of the original FLS that is 
computationally more efficient and numerically stable. Experimental results that have been obtained 
using the FLS-based trading system are described in Section [5] In that section, in order to deal with 
the large number of predictors, we complement FLS with a feature extraction procedure that performs 
on-line dimensionality reduction. We conclude in Section |7] with a discussion on related work and 
directions for further research. 

2 A concise review of trading strategies 

Two popular trading strategies are market timing and trend following. Market timers and trend fol- 
lowers both attempt to profit from price movements, but they do it in different ways. A market timer 
forecasts the direction of an asset, going long (i.e. buying) to capture a price increase, and going 
short (i.e. selling) to capture a price decrease. A trend follower attempts to capture the market trends. 
Trends are commonly related to serial correlations in price changes; a trend is a series of asset prices 
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Figure 1 : Historical prices of Exxon Mobil Corporation and South West Airlines for the period 1997-2007. The 
spread time series, reported in the inset, shows an equilibrium level between the two prices until about January 
2004. 



that move persistently in one direction over a given time interval, where price changes exhibit positive 
serial correlation. A trend follower attempts to identify developing price patterns with this property 
and trade in the direction of the trend if and when this occurs. 

Although the time-varying regression models discussed in this work may be used to implement 
such trading strategies, we will not discuss this further. We rather focus on statistical arbitrage, a 
class of strategies widely used by hedge funds or proprietary traders. The distinctive feature of such 
strategies is that profits can be made by exploiting statistical mispricing of one or more assets, based 
on the expected value of these assets. 



The simplest special case of these strategies is perhaps pairs trading (see 



Elliott et al 



Gatev etal 



[2005], 



200611 ). In this case, two assets are initially chosen by the trader, usually based on an 



analysis of historical data or other financial considerations. If the two stocks appear to be tied to- 
gether in the long term by some common stochastic trend, a trader can take maximum advantage from 
temporary deviations from this assumed equilibrium Q. 

A specific example will clarify this simple but effective strategy. Figure Q] shows the historical 
prices of two assets, South West Airlines and Exxon Mobil; we denote the two price time series by 



This str ategy relies on the idea of co- 



presented in 



Alexander and Dimitriu 



integration. 



200211 and 



Severa l applications of cointegration-based trading strategies are 



Burgess 



2003] 
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y t and x t for t = 1, 2, . . . , respectively. Clearly, from 1997 till 2004, the two assets exhibited some 
dependence: their spread, defined as st = yt — %t (plotted in the inset figure) fluctuates around a long- 
term average of about —20. A trading system implementing a pairs trading strategy on these two assets 
would exploit temporary divergences from this market equilibrium. For instance, when the spread 
st is greater than some predetermined positive constant c, the system assume that the South West 
Airlines is overpriced and would go short on SouthWest Airlines and long on Exxon Mobil, in some 
predetermined ratio. A profit is made when the prices revert back to their long-term average. Although 
a stable relationship between two assets may persist for quite some time, it may suddenly disappear 
or present itself in different patterns, such as periodic or trend patterns. In Figure [T] for instance, the 
spread shows a downward trend after January 2004, which may be captured by implementing more 
refined models. 

2.1 A statistical arbitrage strategy 

Opportunities for pairs trading in the simple form described above are dependent upon the existence of 
similar pairs of assets, and thus are naturally limited. Many other variations and extensions exist that 
exploit temporary mispricing among securities. For instance, in index arbitrage, the investor looks 
for temporary discrepancies between the prices of the stocks comprising an index and the price of a 
futures contrac ^]on that index. By buying either the stocks or the futures contract and selling the other, 
market inefficiency can be exploited for a profit. 

In this paper we adopt a simpler strategy than index arbitrage, somewhat more related to pairs 
trading. The trading system we develop tries to exploit discrepancies between a target asset, selected 
by the investor, and a paired artificial asset that reproduces the target asset. This artificial asset is 
represented by a data stream obtained as a linear combination of a possibly large set of explanatory 
streams assumed to be correlated with the target stream. 

The rationale behind this approach is the following: if there is a strong association between syn- 
thetic and target assets persisting over a long period of time, this association implies that both assets 
react to some underlying (and unobserved) systematic component of risk that explains their dynam- 
ics. Such a systematic component may include all market-related sources of risk, including financial 
and economic factors. The objective of this approach is to neutralize all marker-related sources of 



"A futures contract is an obligation to buy or sell a certain underlying instrument at a specific date and price, in the 
future. 
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risks and ultimately obtain a data stream that best represents the target-specific risk, also known as 
idiosyncratic risk. 

Suppose that y t represents the data stream of the target asset, and y t is the artificial asset estimated 
using a set of p explanatory and co-evolving data streams x%, . . . , x p , over the same time period. In 
this context, the artificial asset can also be interpreted as the fair price of the target asset, given all 
available information and market conditions. The difference y t — yt then represents the risk associated 
with the target asset only, or mispricing. Given that this construction indirectly accounts for all sources 
of variations due to various market-related factors, the mispricing data stream is more likely to contain 
predictable patterns (such as the mean-reverting behavior seen in Figure []} that could potentially be 
exploited for speculative purposes. For instance, in an analogy with the pairs trading approach, a 
possibly large mispricing (in absolute value) would flag a temporary inefficiency that will soon be 
corrected by the market. This construction crucially relies on accurately and dynamically estimating 
the artificial asset, and we discuss this problem next. 



3 Flexible Least Squares (FLS) 

The standard linear regression model involves a response variable y t and p predictor variables x% , . . . , x p , 
which usually form a predictor column vector xt = (a; it, . . . , x p t)'. The model postulates that y t can 
be approximated well by x' t /3, where is a p-dimensional vector of regression parameters. In ordi- 
nary least square (OLS) regression, estimates (3 of the parameter vector are found as those values that 
minimize the cost function 

T 
i=l 

When both the response variable y t and the predictor vector xt are observations at time t of co- 
evolving data streams, it may be possible that the linear dependence between y t and x t changes and 
evolves, dynamically, over ti me. Flexible least squares were introduced at the end of the 80's by 



Tesfatsion and Ka laba [ 1989] as a generalization of the standard linear regression model above in 
order to allow for time-variant regression coefficients. Together with the usual regression assumption 
that 

y t - x' t Pt « (2) 

the FLS model also postulates that 

t +i - A ~ (3) 
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that is, the regression coefficients are now allowed to evolve slowly over time. 

FLS does not require the specification of probabilistic properties for the residual error in ©. This 
is a favorable aspect of the method for applications in temporal data mining, where we are usually 
unable to precisely specify a model for the errors; moreover, any assumed model would not hold true 
at all times. We have found that FLS performs well even when assumption © is violated, and there 
are large and sudden changes from (3 t -i to fit, for some t. We will illustrate this point by means of an 
example in the next section. 

With these minimal assumptions in place, given a predictor x t , a procedure is called for the es- 
timation of a unique path of coefficients, (j t = (f3' lt , . . . , /3' t )', for t = 1,2, — The FLS approach 
consists of minimizing a penalized version of the OLS cost function ©, namely^ 

T T-1 

C(A /i) = J>t - x'A) 2 + fi £ 6 (4) 



where we have defined 



6 = (fit+i 

and fi > is a scalar to be deter mined. 



■t+i 



(5) 



In their original formulation, 



Kalaba and Tesfatsionl jl988l1 propose an algorithm that minimizes 



this cost with respect to every (3 t in a sequential way. They envisage a situation where all data points 
are stored in memory and promptly accessible, in an off-line fashion. The core of their approach is 
summarized in the sequel for completeness. 

The smallest cost of the estimation process at time t can be written recursively as 



c(A+i; M) = inf {(lit ~ x'tPt) 2 + M& + c (Pf, m)} 
Furthermore, this cost is assumed to have a quadratic form 



(6) 



c(p t ; f i) = (3' t S t _ 1 [3 t -2P' t s t _ 1 + r t _ 



(V) 



where St-i and st-\ have dimensions p x p and p X 1, respectively, and r t -i is a scalar. Substituting 
© into © and then differentiating the cost © with respect to f3 t , conditioning on f3t+\, one obtains 
a recursive updating equation for the time- varying regression coefficient 



Pt = d t + Mtfit+i 



(8) 



~ This cost function is called the incompatibility cost in 



Tesfatsion and Kalaba 



1989] 
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with 

d t = /i _1 M t (s t _i + x t y t ) 
M t = (i(St-i + pip + xtx'ty 1 

The recursions are started with some initial So and so- Now, using (HI), the cost function can be written 

as 

c(Pt +1 ; n) = /3' t+1 S t +i - W't+\ s t + n 

where 

S t = KI P - M t ) (9) 
s t = ndt (10) 
r t = r t -i + Vt - (st-i + x t yt)'d t 

and where I p is the p x p identity matrix. In order to apply dSJ, this procedure requires all data points 



till tim e T to be available, so the coefficient vector j3r should be computed first. 



Kalaba and Tesfatsion 



1 198811 show that the estimate of (3t can be obtained sequentially as 

0r = (St-i + x T x' T )~ 1 (s T ^ l + xtVt) 

Subsequently, © can be used to estimate all remaining coefficient vectors 0r-i, ■ ■ ■ > Pi, going back- 
wards in time. 

The procedure relies on the specification of the regularization parameter p > 0; this scalar pe- 
nalizes the dynamic component of the cost function ©, defined in ©, and acts as a smoothness 
parameter that forces the time-varying vector towards or away from the fixed-coefficient OLS solu- 
tion. We prefer the alternative parameterization based on \x = (1 — S)/S controlled by a scalar 5 
varying in the unit interval. Then, with S set very close to (corresponding to very large values of p), 
near total weight is given to minimizing the static part of the cost function (01). This is the smoothest 
solution and results in standard OLS estimates. As 5 moves away from 0, greater priority is given to 
the dynamic component of the cost, which results in time-varying estimates. 

3.1 Off-line and on-line FLS: an illustration 

As noted above, the original FLS has been introduced for situations in which all the data points are 
available, in batch, prior to the analysis. In contrast, we are interested in situations where each data 
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Figure 2: Simulated versus estimated time-varying regression coefficients using FLS in both off-line and on- 
line mode. 

point arrives sequentially. Each component of the p dimensional vector x% represents a new point 
of a data stream, and the path of regression coefficients needs to be updated at each time step so as 
to incorporate the most recently acquired information. Using the FLS machinery in this setting, the 
estimate of (3 t is given recursively by 

% = {St-i + x t x t y l {s t ^i + x t y t ) (1 1) 

where, by substituting M t and dt in (© and (flOl ). we obtain the recursions of St and st as 

S t = n{S t -i + \il v + x t x' t )~ l (S t -i + xtx't) (12) 
s t = /J,(S t -i +nl p +x t x' t )~ 1 (st-i +x t y t ) 

These recursions are initially started with some arbitrarily chosen values So and so- 

Figure |2] illustrates how accurately the FLS algorithm recovers the path of the time- varying co- 
efficients, in both off-line and on-line settings, for some artificially created data streams. The target 
stream y t for this example has been generated using the model 

y t = x t (3 t + e t (13) 
where et is uniformly distributed over the interval [—2,2] and the explanatory stream xt evolves as 

x t = 0.8a;t_i + z t 
9 



with Zt being white noise. The regression coefficients have been generated using a slightly complex 
mechanism for the purpose of illustrating the flexibility of FLS. Starting with f3\ = 7, we then generate 

0t as 



Pt-i + at 

Pt-i + 4 

Pt-i + h 
5sin(0.5t) + c t 



for t = 2, . . . , 99 
for t = 100 
fort = 101,..., 200 
fort = 201,... ,300 



where at and bt are Gaussian random variables with standard deviations 0.1 and 0.001, respectively, 
and c t is uniformly distributed over [—2,2]. We remark that this example features non-Gaussian error 
terms, as well as linear and non-linear behaviors in the dynamics of the regression coefficient, varying 
over time. 

In this example we set 5 = 0.98. Although such a high value of 5 encourages the regression 
parameters to be very dynamic, the nearly constant coefficients observed between t = 101 and t = 
200, as well as the two sudden jumps at times t = 100 and t = 201, are estimated well, and especially 
so in the on-line setting. The non-linear dynamics observed from time t = 201 onwards is also well 
captured. 



4 An alternative look at FLS 

In section [3l we have stressed that FLS relies on a quite general assumption concerning the evolution 
of the regression coefficients, as it only requires (3 t +i — fit to be small at all times. Accordingly, 
assumption ((3]) does not imply or require t hat each vector @ t is a random vector. Indeed, in the 



original work of 



Kalaba and Tesfatsion 



1 198811 . {/3t} is not treated as a sequence of random variables, 
but rather taken as a sequence of unknown quantities to be estimated. 

We ask ourselves whether we can gain a better understanding of the FLS method after assuming 
that the regression coefficients are indeed random vectors, without losing the generality and flexibility 
of the original FLS method. As it turns out, if we are willing to make such an assumption, it is 
possible to establish a neat algebraic correspondence between the FLS estimation equations and the 
well-known Kalman filter (KF) equations. This correspondence has a number of advantages. Firstly, 
this connection sheds light into the meaning and interpretation of the smoothing parameter fx in the 
cost function (01). Secondly, once the connection with KF is established, we are able to estimate 
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the covariance matrix of the estimator of ft t . Furthermore, we are able to devise a more efficient 
version of FLS that does not require any matrix inversion. As in the original method, we restrain from 
imposing any specific probability distribution. The reminder of this section is dedicated to providing 
an alternative perspective of FLS, and deriving a clear connection between this method and the well- 
known Kalman filter equations. 

4.1 The state-space model 

In our formulation, the regression coefficient at time t+ 1 is modeled as a noisy version of the previous 
coefficient at time t. First, we introduce a random vector io t with zero mean and some covariance 
matrix V w , so that 

P t+1 =p t + co t t = 0,l,...,T-l. (14) 

Then, along the same lines, we introduce a random variable e t having zero mean and some variance 
V e , so that 

y t = x' t p t + e t t = l,...,T. (15) 

Equations (fT4l and ([TBI , jointly considered, result in a linear state-space model, for which it is as- 
sumed that the innovation series {et} and {u>t} are mutually and individually uncorrected, i.e. ej is 
uncorrected of ej, u>i is uncorrected of toj, and is uncorrected of ujg, for any i ^ j and for any 
k,£. It is also assumed that for all t, e t and uj t are uncorrected of the initial state /3 - It should be 
emphasized again that no specific distribution assumptions for et and uj t have been made. We only 
assume that et and ujt attain some distributions, which we do not know. We only need to specify the 
first two moments of such distributions. In this sense, the only difference between the system specified 
by (fT4l)-(fT5l) and FLS is the assumption of randomness of fit- 

4.2 The Kalman filter 



The Kalman filter I Ka lman . 



1960] is a powerful method for the estimation of (3t in the above linear 
state-space model. In order to establish the connection between FLS and KF, we derive an alternative 
and self-contained proof of the KF recursions that make no assumptions on the distributions of et and 



LOf We have found relat ed proof s of su ch recursions that do not rely on probabilistic assumptions, 



as m 



Kalman [ 1960] and 



Eubank [2006]. In comparison with these, we believe that our derivation is 



simpler and does not involve matrix inversions, which serves our purposes well. 
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We start with some definitions and notation. At time t, we denote by t the estimate of (3 t and by 
yt+i = E(yt+i) the one-step forecast of yt+i, where E(.) denotes expectation. The variance of yt+x 
is known as the one-step forecast variance and is denoted by Q t = Var(y t+ i). The one-step forecast 
error is defined as e t = y t — E(y t ). We also define the covariance matrix of /3 t — j3 t as Pt and the 
covariance matrix of Pt — Pt-x as Rt and we write Cov(p t — Pt) = Pt an d Cov(/3 4 — Pt-x) = Rt- 
With these definitions, and assuming linearity of the system, we can see that, at time t — 1 

Rt = P t -i + V u 
Vt = x'tPt-i 
Qt = x' t R t x t + V e 

where P t _i and Pt-\ are assumed known. The KF gives recursive updating equations for P t and Pt 
as functions of Pt-i and Pt-i- 

Suppose we wish to obtain an estimator of p t that is linear in y t , that is p t = at + K t yt, for some 
at and K t (to be specified later). Then we can write 



Pt = a* + K t e t 



(16) 



with et = yt — x' t p t -\. We will show that for some K t , if Pt is required to minimize the sum of 
squares 

T 

C = ^2(y t -x' t Pt) 2 (17) 



t=i 



then a* t = fi t -\. To prove this, write Y = (yi, . . . , y T )', X = (x[, . . . , x' T )', B = (/?(,..., P' T )', 
£ = (ex,..., ex)' and 

( Kx ••• \ 
K 2 ■■■ 



K 



\ ••• Kt y 



Then we can write dTvT t as 



C = C(B) = (Y — XB)'(Y - XB) 
and B = A* + K£, where A* = ((af)', (a* T )')'. We will show that A* = B*, where B* 
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(P' Q , • • • , Pt-iY- With the above B, the sum of squares can be written as 

S(B) = (Y - XA* - XK£)'(Y - XA* - XK£) 

= (Y - XA*)'(Y - XA*) -2(Y - XA*)'XK£ 
+£'K'X'XK£ 

which is minimized when Y — XA* is minimized or when E(Y — XA*) = 0, leading to A* = B* 
as required. Thus, a* = ftt-i and from (fTBT ) we have 

% = + K t e t (18) 

for some value of K t to be defined. From the definition of P t , we have that 

P t = Cov (f3 t - + K t (x' t f3 t + e t - x0 t ^))) 
= Cov((/ p - K t x' t )((3 t - A_0 - K t e t ) 
= (I p -K t x' t )R t (I p -x t K' t ) + V e K t K' t 

= R t -K t x' t R t -R t x t K' t + Q t K t K' t (19) 
Now, we can choose K t that minimizes 

E(fit - %Y(Pt ~ Pt) 

which is the same as minimizing the trace of P t , and thus K t is the solution of the matrix equation 

dtrace(P f ) , 

— — = -2(x t R t ) + 2g 4 A t = 

where <9trace (Pt)/dK t denotes the partial derivative of the trace of Pt with respect to K t . Solving 
the above equation we obtain K t = R t x t /Q t . The quantity K t , also known as the Kalman gain, is 
optimal in the sense that among all linear estimators (3 t , ( fT8l > minimizes E(p t — Pt)'{Pt ~ Pt)- With 
K t = RtXt/Qt, from ( fT9l the minimum covariance matrix P t becomes 

P t = Rt- QtKK't (20) 

The KF consists of equations (fT8T ) and (1201 . together with 

K t = Rtxt/Qt 
R t = P t _i + K, 
Qt = x' t R t x t + K and 
et = Vt- x'tpt-l 
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Initial values for (5q and Pq have to be placed; usually we set (3q = and P 1 = 0. 
Note that from the recursions of Pt and Rt we have 

R t+1 = R t - Q t K t K' t + V u 



(21) 



4.3 Correspondence between FLS and KF 

Traditionally, the KF equations ar e derived under the assumption that e t and u>t follow the normal 



distribution, as in iJazwinskil |1970Q . This stronger distributional assumption allows the derivation of 
the likelihood function. When the normal likelihood is available, we note that its maximization is 
equivalent to minimizing the quantity 

T j T-1 

V, 



t=l 



t=l 



Jazwinskil j 1970(1 for a proof). The 



with respect to f3\, (5t, where & has been defined in (O (see 
above expression is exactly the cost function ((U) with \i replaced by l/V^. 

This correspondence can now be taken a step further: in a more general setting, where no distribu- 
tional assumptions are made, it is possible to arrive to the same result. This is achieved by rearranging 
equation (TTTT t in the form of (fT8l) . which is the KF estimator of f3 t . First, note that from (PT2l) we can 
write 



(S t -i + x t x' t ) 1 



(j,S t 1 (S t -i + fJ-Ip + x t x 



and substituting to (fTTT > we get j3t = S t 1 st- Thus we have 



A - Pt-i 



s t St-S^st-i 

(S t -i + x t x' t y 1 (s t -i + x t y t ) - S^st-i 

_i S^xtx'tS^st-i + x t y t ) 
St-xXtVt 



S t \xt 



x' t S t \x t + 1 
-{ytx' t S^\x t + y t 



x' t St\x t + 1 
-x' t St\s t -i - x'tS^xtyt) 
St\x t 



x t s t-i x t + 1 



(yt - x'tPt-i) = K t e t 



14 



with 



K t 



Rtxt/Qt 



»-i 
't-i 



+ 1 



1 



It remains to prove that the recursion of St as in ([T2T) communicates with the recursion of (|2TI) . for 
.Ri+i = S£" . To end this, starting from (fl2T) and using the matrix inversion lemma, we obtain 

Rt+i = Sf 1 = ^(St-i + x t x'^~ l {S t -\ + nip + x t x' t ) 



which is the KF recursion (|2TT) . where V u = fi Ip. 

Clearly, the FLS estimator (3 t of (fTTb is the same as the KF estimator t of (fPST ). From this 
equivalence, and in particular from V w = n~~ l I p , it follows that 



This result further clarifies the role of the smoothing parameter \i in (0]). As \x — > oo, the covari- 
ance matrix of @ t +i — Pt is almost zero, which means that f3 t +i = Pt, for all t, reducing the model 
to a usual regression model with constant coefficients. In the other extreme, when \i w 0, the covari- 
ance matrix of f3 t +i — flt has very high diagonal elements (variances) and therefore the estimated /3 t 's 
fluctuate erratically. 

An important computational consequence of the established correspondence between the FLS and 
the KF is apparent. For each time t, FLS requires the inversion of two matrices, namely St~\ + xtx' t 
and St-i + flip + xtx' t . However, these inversions are not necessary, as it is clear by the KF that (3 t 
can be computed by performing only matrix multiplications. This is particulary useful for temporal 
data mining data applications when T can be infinite and p very large. 

It is interesting to note how the two procedures arrive to the same solution, although they are 
based on quite different principles. On one hand, FLS merely solves an optimization problem, as it 



= A* 



i 



(I p + n(S t -i + x t x' t ) x ) 



= A 1 




Cov(A+i - fy) = -I p 
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minimizes the cost function C(/i) of @. On the other hand, KF performs two steps: first, all linear 
estimators are restricted to forms of (fT8T ). for any parameter vector Kt; in the second step, Kt is 
optimized so that it minimizes P t , the covariance matrix of fit — fit- This matrix, known as the error 
matrix of fi t , gives a measure of the uncertainty of the estimation of fi t . 

The relationship between FLS and KF has important implications for both methods. For FLS, it 
suggests that the regression coefficients can be learned from the data in a recursive way without the 
need of performing matrix inversions; also, the error matrix P t is routinely available to us. For KF, 
we have proved that the estimator fit minimizes the cost function C(/x) = C(l/V w ) when only the 
mean and the variance of the innovations e t and Lo t are specified, without assuming these errors to be 
normally distributed. 



5 An FLS-based algorithmic trading system 
5.1 Data description 

We have developed a statistical arbitrage system that trades S&P 500 stock-index futures contracts. 
The underlying instrument in this case is the S&P 500 Price Index, a world renowned index of 500 
US equities with minimum capitalization of $4 billion each; this index is a leading market indicator, 
and is often used as a gauge of portfolio performance. The constituents of this index are highly traded 
by traditional asset management firms and proprietary desks worldwide. The data stream for the S&P 
500 Futures Index covers a period of about 9 years,from 02/01/1997 to 25/10/2005. The contract 
prices were obtained from Bloomberg, and adjusteqj to obtain the target data stream as showed in 
Figure [3] Our explanatory data streams are taken to be a subset of all constituents of the underlying 
S&P 500 Price Index. The constituents list was acquired from the Standard & Poor's web site as of 
1st of March 2007, whereas the constituents data streams were downloaded from Yahoo! Financial. 
The constituents of the S&P index are added and deleted frequently on the basis of the characteristics 
of the index. For our experiments, we have selected a time-invariant subset of 432 stocks, namely all 
the constituents whose historical data is available over the entire 1997 — 2005 period. 

The system thus monitors 433 co-evolving data streams comprising one target asset and 432 ex- 
planatory streams. All raw prices are pre-processed in several ways: data adjustments are made for 



4 Futures contracts expire periodically; since the data for each contract lasts only a few weeks or months, continuous data 
adjustment is needed in order to obtain sequences of price data from sequences of contract prices. 
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S&P500 Futures Index 




Jan98 JanOO Jan02 Jan04 



Figure 3: S&P 500 Futures Index for the available 9-years period 

discontinuities relating to stock splits, bonus issues, and other financial events; missing observations 
are filled in using the most recent data points; finally, prices are transformed into log-returns. At each 
time t > 1, the log-return for asset i is defined as 

ru = logpu - logp i( i_i) i = 1, . . . , 432 

where pu is the observed price of asset i at time t. Taking returns provides a more convenient repre- 
sentation of the assets, as it makes different prices directly comparable and center them around zero. 
We collect all explanatory assets available at time t in a column vector r t . Analogously, we denote by 
at the log-return of the S&P 500 Futures Index at time t. 

5.2 Incremental SVD for dimensionality reduction 

When the dimensionality of the regression model is large, as in our application, the model might 
suffer from multicollinearity. Moreover, in real-world trading applications using high frequency data, 
the regression model generating trading signals need to be updated quickly as new information is 
acquired. A much smaller set of explanatory streams would achieve remarkable computational speed- 
ups. In order to address all these issues, we implement on-line feature extraction by reducing the 
dimensionality in the space of explanatory streams. 

Suppose that Rt = E(r t r' t ) is the the unknown population covariance matrix of the explanatory 



streams, with data available up to time t = 1, . . . , T. The algorithm proposed by 



Weng et al. 



[2003] 
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provides an efficient procedure to incrementally update the eigenvectors of the R t matrix as new data 
are made available at time t + 1. In turn, this procedure allows us to extract the first few principal 
components of the explanatory data streams in real time, and effectively perform incremental dimen- 
sionality reduction. 



A brief outline of the procedure suggested by 



Weng et al. 



1 2003] is provided in the sequel. First, 



note that the eigenvector g t of R t satisfies the characteristic equation 



h = Xt9t = Rt9t 



(22) 



where Xt is the corresponding eigenvalue. Let us call h t the current estimate of h t using all the data 
up to time t (t = 1, . . . >T). We can write the above characteristic equation in matrix form as 

/ h x \ ( Ri ■■ 



h 



\h T ) 



\ / * \ 



V 



Rt ) 



Rg 



\9T J 



and then, noting that 

hl + '' T + ^ = kl,---AYh = hRi,...,Rr)g = ±J2 R '* 



i=l 



the estimate hr is obtained by hr = (hi + • • • + hr)/T by substituting Ri by r^r-. This leads to 



h 



1 * 

t = T^ r ^ 



(23) 



i=i 



which is the incremental average of r^gi, where accounts for the contribution to the estimate of 
Ri at point i. 

Observing that g t = h t /\\h t \\, an obvious choice is to estimate g t as h t -i/\\h t -i\\; in this setting, 
/io is initialized by equating it to r\ , the first direction of data spread. After plugging in this estimator 
in d23l . we obtain 



E 



TiTi 



i h 



i-l 



(24) 



- i= i \\hi-i\\ 

In a on-line setting, we need a recursive expression for h t . Equation (1241 can be rearranged to 
obtain an equivalent expression that only uses ht-i and the most recent data point rt, 

t-l 

t 



ht 



1 ^ , hi-i 1 , ht-i 
- > nr^— — r t r+— 

tZ ^ ll^-ill 1 ll^-il 



t-l t 1 , ht-i 

——ht-i + -r t r t -f ■ 

1 1 \\ht-i\ 
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The weights (t — l)/t and l/t control the influence of old values in determining the current estimates. 
F ull details related to the computation of the subsequent eigenvectors can be found in the contribution 



of 



Weng et al. 



1200311 . 

In our application, we have used data points from 02/01/1997 till 01/11/2000 as a training set 
to obtain stable estimates of the first few dominant eigenvectors. Therefore, data points prior to 



01/1 1/2000 will be excluded from the experimental results. 



5.3 Trading rule 

The trade unit for S&P 500 Futures Index is set by the Chicago Mercantile Exchange (CME) to $250 
multiplied by the current S&P 500 Price Index, p t . Accordingly, we denote the trade unit expressed 
in monetary terms as Ct = 250 pt, which also gives the contract value at time t. For instance, if 
the current stock index price is 1400, then an investor is allowed to trade the price of the contract, 
i.e. $35000, and its multiples. In our application, we assume an initial investment of $100 million, 
denoted by w. The numbers of contracts being traded on a daily basis is given by the ratio of this 
initial endowment w to the price of the contract at time t, and is denoted by tt*. 

We call r t the set of explanatory streams. In the experimental results of Section [6l r t will either 
be the 432-dimensional column vector including the entire set of constituents (the without SVD case), 
or the reduced 3-dimensional vector of three principal components computed incrementally from the 
432 streams (the with SVD case) using the method of Section |5T2~1 

Given target and explanatory streams, respectively a t and r t , the FLS algorithm updates the current 
estimate of the artificial asset at time t. With the most updated estimate of the artificial asset, the 
current risk (i.e. the regression residual) data point is derived as 

s t = a t - r' t Pt (25) 

The current position, i.e. the suggested number of contracts to hold at the end of the current day, 
is obtained by using 

= 4>{st+i)n t 

where </>(s t+ i) is a function of the predicted risk. In our system, we deploy a simple functional 
(commonly known to practitioners as the plus-minus one rule), given by 

4>{s t+ i) = -sign(si) (26) 
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This rule implies that the risk data stream exhibits a mean-reverting behavior. The spread stream 
of Figure 0J as well as our experimental results, suggest that this assumption generally holds true. 
More formal statistical procedures could be used instead to test whether mean-reversion is satisfied 
at each time t. More realistic trading rules would also be able to detect more general patterns in the 
spread stream, and should take into consideration the uncertainty associated with the presence of such 
patterns, as well the history of previous trading decisions. 

Having obtained the number of contracts to hold, the daily order size is given by 

= $t(st) - #t-i(st) 

rounded to the nearest integer. The trading systems buys or sells daily in order to maintain the sug- 
gested number of contracts. The monetary return realized by the system at each time t is given by 

ft = 250 fa - Pt-i) #t-l(st) 

6 Experimental results 

In this section we report on experimental results obtained from the simple FLS-based trading system. 
We have tested the system using a grid of values for the smoothing parameter 5 described in Section 
[3j to understand the effect of its specification. Table Q] shows a number of financial performance 
indicators, as well as a measure of goodness of fit, with and without incremental SVD. 

The most important financial indicator is the Sharpe ratio, defined as the ratio between the average 
monetary returns and its standard deviation. It gives a measure of the mean excess return per unit of 
risk; values greater than 0.5 are considered very satisfactory, given that our strategy trades one single 
asset only. Another financial indicator reported here is the maximum drawdown, the largest movement 
from peak to bottom of the cumulative monetary return, reported as percentage. The mean square error 
(MSE) has been computed both in sample and out of sample. 

Figure [5] shows gross percentage returns over the initial endowment for the constituent set, ft/w, 
obtained using three different systems: FLS-based system with incremental SVD (using only the 
largest principal component), FLS-based system without SVD, and a buy-hold strategy. Buy-hold 
strategies are typical of asset management firms and pension funds; the investor buys a number of 
contracts and holds them throughout the investment period in question. Clearly, the FLS-based sys- 
tems outperforms the index and make a steady gross profit over time. The assumption of non existence 
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Table 1 : Experimental results obtained using the statistical arbitrage system of Section|5]on 9-years of S&P 500 Future Index. Each column contains a summary 



statistics obtained with (left-hand values) and without (right-hand values) incremental SVD. The summaries are: daily percentage gain, daily percentage loss, 
maximum drawdown in percentage, percentage of winning trades, percentage of losing trades, annualized percentage return, annualized percentage volatility of 
returns, Sharpe ratio (defined as the ratio of the two previous quantities), in-sample MSE and out-sample MSE. *To be multiplied by 10e5. 




Figure 4: Spread stream s t for a subset of the entire period. The FLS model is based on the largest principal 
component and 5 = 0.2. 

of transactions costs, although simplistic, is not particularly restrictive, as we expect that this strat- 
egy will not be dominated by cost, given that new transactions are made only daily. Moreover, we 
assume that the initial endowment remains constant throughout the back-testing period, which has an 
economic meaning that the investor/agent consumes any capital gain, as soon as is earned. 

Finally, Figure [6] shows the estimated time- varying regression coefficients of the three first prin- 
cipal components, and Figure |7] shows coefficients of three constituent assets when no SVD has been 
applied. The coefficients associated to the first component change very little over the 9 years period, 
whereas the coefficients for the two other components smoothly decrease over time, with some quite 
abrupt jumps in the initial months of 2001. As we can see from Table[U a fairly large value of S = 0.2 
gives optimal results and reinforces the merits of time- varying regression in this context. 

7 Conclusions 

We have argued that the FLS method for regression with time-varying coefficients lends itself to a 
useful temporal data mining tool. We have derived a clear connection between FLS and Kalman filter 
equations, and have demonstrated how this link enhances interpretation of the smoothing parameter 
featuring in cost function that FLS minimizes, and naturally leads to a more efficient algorithm. Fi- 
nally, we have shown how FLS can be employed as a building-block of an algorithmic trading system. 
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Figure 5: Gross profits and losses for three competing systems: FLS based on SVD (using 6 = 0.2), FLS 
based on all explanatory streams (using 8 = 0.2) and a buy-and-hold strategy. 




Figure 6: Dynamycs of FLS-estimated regression coefficients associated to the first three principal compo- 
nents, with (5 = 0.2. 
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Figure 7: Dynamycs of FLS-estimated regression coefficients associated to three constituents of the index, 
with S = 0.2. 




Figure 8: Sharpe ratio as function of S 
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There are several aspects of the simple system presented in Section[5]that can be further improved 
upon, and the remainder of this discussion points to a few general directions and related work that we 
intend to explore in the future. 

The problem of feature selection is an important one. In Section[5]the system relies on a set of 432 
constituents of the S&P 500 Price Index under the assumption that they explain well the daily move- 
ments in the target asset. These explanatory data streams could be selected automatically, perhaps even 
dynamically, from a very large basket of streams, on the basis of they similarity to the target asset. 
This line of investigation relates to the correlation detection problem for data streams, a well-studied 



Guha et al. 



|2003] propose an algorithm 



and recurrent issue in temporal data mining. For instance, 
that aims at detecting linear correlation between multiple streams. At the core of their approach is a 
technique for approximating the SVD of a large matrix by using a (random) matrix of smaller size, at a 
given accuracy level; the SVD is then periodically and randomly re-c omputed over time, a s mor e data 



points arr i ve. T he SPIRIT system for streaming pattern detection of 



Sun et al. 



Papadimitriou et al. 



[2005] and 



1 2006] incrementally finds correlations and hidden variables summarising the key trends in 
the entire stream collection. 

Of course, deciding on what similarity measure to adopt in order to measure how close explana- 
tory and target asse ts are is not an e asy task, and is indeed a much debated issue (see, for instance, 



Gavrilov et al. 



3]). 



|2000]). For instance, 



Shasha and Zhul 112004] adopt a sliding window model and the 



Euclidean distance as a measure of similarity among streams. Their StatStream system can be used 
to detect pairs of financial time series with high correlation, among many available data streams. 



Cole et al. 



1 200511 combine several techniques (random projections, grid structures, and others) in 



order to compute Pearson correlation coefficient s between data streams. O ther measures, such as 



dynamic time warping, have also been suggested BCapitani and CiacciaL 



2005]. 



Real-time feature selection can be complemented by feature extraction. In our system, for in- 
stance, we incrementally reduce the original space of 432 explanatory streams to a handful of di- 
mensions using an on-line version of SVD. Ot her dynamic dimensionality reduction models, such 



as incre mental independen t component analysis llBasalyga and Rattray , 



2004] or non-linear manifold 



2004], as well as on-line clustering methods, would offer potentially useful alter- 



learning IILaw et al.L t 
natives. 

Our simulation results have shown gross monetary results, and we have assumed that transaction 
costs are negligible. Better trading rules that explicitly model the mean-reverting behavior (or other 
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patterns) of the spread data stream and account for transaction costs, as m ICarcano et all 1120051. can 
be considered. The trading rule can also be modified so that trades are placed only when the spread 
is, in absolute value , greater than a certain threshold determined in order to maximize profits, as in 



Vidyamurthyl 11200411 . In a realistic scenario, rather than trading one asset only, the investor would 



build a portfolio of models; the resulting system may be opti mized using measures that capture both 



the forecasting and financial capabilities of the system, as in 



Towers and Burgessl 1200111 . 



Finally, we point out that the FLS method can potentially be used in other settings and 



applica- 



tions, such as predicting co-evolving data stream s with missing 



12000], and for outlier and fraud detection, as in 



Adams et al 



or de layed observations, as in 



Yi et al. 



[2006]. 
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